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Abstract 

A goodness-of-fit test for the fitting of a parametric model to data obtained 
from a detector with finite resolution and limited acceptance is proposed. 
The parameters of the model are found by minimization of a statistic that 
is used for comparing experimental data and simulated reconstructed data. 
Numerical examples are presented to illustrate and validate the fitting pro- 
cedure. 
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1. Introduction 

The probability density function (PDF) P(x') of a reconstructed char- 
acteristic x' of an event obtained from a detector with finite resolution and 
limited acceptance can be represented as 



where p(x) is the true PDF, A(x) is the acceptance of the setup, i.e. the 
probability of recording an event with a characteristic x, and R(x, x') is the 
experimental resolution, i.e. the probability of obtaining x' instead of x after 
the reconstruction of the event. The integration in (TjQ) is carried out over the 
domain Q of the variable x and the domain Q' of the variable x' . 
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There are two ways of fitting a parametric model p(x, a\, a-i, ■ ■ ■ , a{) of the 
true PDF to reconstructed data: 

1. Find an estimate p u {x) of the true PDF p(x) by solving an unfolding 
problem, and then fit a parametric model of the true PDF p(x, a\,a2, 
. . . ,ai) to the unfolded distribution p u (x). 

2. Fit a parametric model of the reconstructed PDF, that is, 

J n p(x,a 1 ,a 2 , ■ ■ ■ ,ai)A(x)R(x,x') dx 
fci> Ja^( x ' Ql > a 2, • • - , ai)A(x)R(x, x') dx dx 1 ' 

directly to the reconstructed data. 

These two possibilities have been discussed in [H, 0]. The acceptance A(x) 
and the resolution function R(x, x') must be defined for both methods. In 
the majority of cases, they cannot be found analytically, and a Monte Carlo 
method is used instead for that purpose. The unfolding (inverse) problem 
is known to be an ill-posed problem and cannot be solved without a priori 
information about the solution. Any solution of this problem has, in addi- 
tion to statistical errors due to the finite statistics of the experimental data, 
also systematic errors related to the use of a priori information. These sys- 
tematic errors have an influence on the choice of the parametric model and 
the estimation of the parameters in the first method. The second method is 
preferable in many cases. 

After discretization of the problem, the authors of [l[ found the accep- 
tance function A(x) and the resolution function R(x, x'), and then used these 
functions to fit the parameters of the true distribution. The main disadvan- 
tage of this approach is that the resolution function R(x,x'), which is a 
matrix after discretization, has rather noisy matrix elements because in real 
cases the size of the Monte Carlo sample of events is of the same order as 
the size of the experimental sample of events. Another source of uncertainty 
is the discretization. Also, the authors of [l[ did not propose a statistic that 
could be used for a goodness-of-fit test. 

In 0|, a reweighting procedure for fitting a Monte Carlo reconstructed 
distribution to the reconstructed data was proposed. The procedure was 
presented rather sketchily, and cannot be repeated even for the example that 
was used in Q for illustration. There is not a clear explanation of how the 
parameters and the errors in them were calculated. The authors of stated, 
without proof, that the statistic used for the fitting of the parameters had 
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a x 2 distribution but did not define the number of degrees of freedom. This 
makes it impossible to use this statistic for choosing the best model from a 
set of alternative parametric models. 

Recently, a test for comparing a weighted histogram jij that is a general- 
ization of the classical chi-square test [I| has been proposed. In this paper, we 
apply the results obtained in ji[ to develop a procedure for direct paramet- 
ric fitting of data obtained from detectors with finite resolution and limited 
acceptance. 

This paper is organized as follows. In Section 2, a method for fitting the 
parameters of the model of the true PDF to the data and a goodness-of-fit 
test are proposed. A statistic for comparing a histogram with unweighted 
entries and a histogram with weights that depend on the parameters is used 
for that purpose. In Section 3, a numerical example that demonstrates how 
one can estimate the parameters and the statistical errors in them practically 
is presented. A numerical experiment with 10 000 runs is described to validate 
the proposed method. 

2. Parametric fitting of Monte Carlo results to data 

We consider the PDF Pi(x') of a reconstructed characteristic of experi- 
mental events and the PDF P2(x') of the corresponding reconstructed char- 
acteristic of the Monte Carlo events for the same detector. 

A histogram with m bins for a given PDF P\{x') is used to estimate the 
probability Pu that a random event belongs in bin i: 



The integration in is carried out over the bin S*-, and ^2™ Pu = 1. The 
histogram can be obtained as the result of a random experiment with the 
PDF P\(x'). We denote the number of random events belonging to the ith 
bin of the histogram by riu- The total number of events in the histogram is 
equal to n\ = YlT=i The quantity Pj = nu/ri\ is an estimator of Pu with 
an expectation value EPh = Pu. 

A histogram of the Monte Carlo reconstructed PDF P^x 1 ) can be ob- 
tained as the result of a random experiment (simulation) that has three 
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steps j^j: 
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1. A random value x is chosen according to a PDF g(x). The function 
g(x) is some expected true (initial) distribution defined in the domain 

n. 

2. We go back to step 1 again with probability 1 — A(x), and to step 3 
with probability A(x). 

3. A random value x' is chosen according to the PDF R(x,x'). 

The events x' are distributed according to the PDF P™(x'), where 

pin/M = j Q g{x)A{x)R{x,x')dx 

In 1 In g( x )A(x)R(x, x') dx dx' 

The quantity PJ? = n-2i/n2, where n2i is the number of events belonging to 
the ith bin for a histogram with total number of events ri2, is an estimator 
of P£, 

P&= l P l 2 n {x')dx\ i = l,...,m, (5) 
Js> 

with the expectation value of the estimator equal to 

EI% = I$. (6) 

It is expected that A(x) and the resolution function R(x, x') for the real setup 
and for the Monte Carlo simulation will be the same. This is achieved by 
adjusting the Monte Carlo simulation program and by a suitable choice of 
the domains Q and Q' of the variables x and x'. 

In experimental particle and nuclear physics, step 3 is the most time- 
consuming step of the Monte Carlo simulation. This step is related to the 
simulation of the process of transport of particles through a medium and the 
rather complex registration apparatus. 

To use the results of the simulation with an initial PDF g(x) to calculate 
a histogram of events distributed according to the PDF Pife') with a true 
PDF p(x), we write the equation for P^i in the form 

f s , f n p(x)A(x)R(x, x') dxdx' r r 

P< H = 1 A— — — — = / / w(x)g(x)A(x)R(x,x')dxdx', 

Jn> J n P{x)A(x)R{x,x')dxdx' Js'Jn 

(7) 

where 

w(x) = p(x)/g(x) / / p(x)A(x)R(x, x') dx dx' (8) 
Jn 1 Jn 
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is the weight function. Because of the condition ^\ P 2i = 1, we shall call the 
weights defined above "normalized weights" from now on, as opposed to the 
unnormalized weights w(x), which are given by w(x) = const ■ w(x). 

The Monte Carlo reconstructed histogram for the PDF P2{x') can be 
obtained using reconstructed events for the PDF P^{x') with weights cal- 
culated according to (jHD- In this way, we avoid step 3 of the simulation 
procedure, which is important in cases where one needs to calculate Monte 
Carlo reconstructed histograms for many different true PDFs. 

We denote the sum of the weights of the events in the ith bin of the 
histogram with normalized weights by 



where n^i is the number of events in bin i and Wi(k) is the weight of the kth 
event in the ith bin. The quantity P 2 j = Wi/n 2 is an estimator of P 2 j with 



A frequently used technique in data analysis is the comparison of a re- 
constructed PDF with a Monte Carlo reconstructed PDF through a com- 
parison of histograms. The hypothesis of homogeneity |H states that the 
two histograms represent random values with identical distributions. This 
is equivalent to assuming that there exist m constants Pi, ■ ■ ■ ,p m such that 
YliLiPi = 1 an d that the probability of belonging to the ith bin for some 
measured value in the experiment and in the Monte Carlo simulation is equal 
to Pi. 

From here onwards, we use the weighted histogram with unnormalized 
weights and Wi denote the sum of the weights of the events in bin i. This 
is convenient because the calculation of normalization factors is quite prob- 
lematic in many practical cases. 

We introduce the statistics I3J 




(9) 
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expectation value EP ; 



P-n- 
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where 
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tfc(fc)/E ( 12 ) 



and 

k=l k=l 

with the sums extending over all bins i except one bin k. In these equations, 
the probabilities Pi are unknown, and estimators of pi can be found by mini- 
mization of ffTUj) . We denote by X\ the value of X\ after substitution of the 
estimators pi into f lTUj) . As shown in [ij], the statistic 

X 2 = Med{XlXl...,X 2 m } (13) 

approximately has a Xm~2 distribution if the hypothesis of homogeneity is 
valid. 

We substitute the PDF p(x) by the parametric formula p(x, a±, a 2 , . . . , <fy); 
the weights of the Monte Carlo events and the statistic X 2 (a>i, a 2 , ■ ■ ■ , a{) 
are then dependent on the parameters. The estimators 0,1,0,%, ... ,ai of the 
parameters ai, a% . . . , ai can be found by minimization of this statistic. The 
statistic X 2 (di, 02, ■ ■ ■ ,a{) has a Xm-2-1 distribution if the parametric model 
fits the data, because / parameters are estimated. It can be used for a 
goodness-of-flt test for selection of the best model from a set of alternative 
models. 

3. Numerical examples 

In this section two examples are presented to illustrate and validate the 
fitting procedure. 

3.1. Example 1 

We took the true PDF, as in I2|, to be of the form 



p(x) = (l + aO/1.5, (14) 

defined on the interval [0,1]. The reconstructed PDF was defined as 

= J n p(x)A(x)R(x,x')dx ^ 
fci' JqP( x )A(x)R(x, x') dxdx r 

with an acceptance function A(x) = 1 and a resolution function of the form 

RM = ^-M~ (l iP)' (16) 
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with er = 0.3. The domain of the variable x was taken as Q = [0, 1] and that 
of x' as Q' = [—0.3, 1.3]. A simulation of events with a PDF P\{x') was done 
according to the algorithm described in the previous section. 

For the Monte Carlo reconstructed PDF, we used the initial PDF g(x) = 1 
and chose the parametrization for the true PDF in the form 

p(x, a) oc 1 + ax. (17) 

The Monte Carlo reconstructed PDF was defined as 

= J n p(x,a)A(x)R(x,x')dx 
In' In p( x > a )A(x)R(x, x') dx dx' 

Monte Carlo events were simulated according to the algorithm described 
in the previous section with g(x) as the initial PDF. Weights of events were 
calculated according to the formula w(x) = I — ax. Reconstructed and Monte 
Carlo reconstructed samples were simulated by generating 5 • 10 2 , 5 • 10 3 , and 
5- 10 4 events in the first step. We chose 5-bin and 20-bin histograms and used 
pairs of histograms with various numbers of events in the fitting procedure. 
10 000 simulation runs were done for each case. To investigate the fitting 
procedure, the following quantities were calculated: 

• Average values a = ^1°°° d(i)/10 000 of the estimated parameter, 
where i is the run number. 

• Average statistical errors A of the estimated parameter. For this pur- 
pose, the various realizations of the estimator a were ordered, and then 
the positive error was defined as the minimal interval with lower bound 
a that contained 34.1345% of the realizations of a and the negative error 
was defined as the minimal interval with upper bound a that contained 
34.1345%. 

• The real sizes of the test a s for a nominal test size a = 5% were 
estimated as the fraction of runs that had a p- value lower than 5%. 

The program MINUIT ^ was used for the minimization of X 2 (a) and for 
error analysis. 

The results of this calculation are presented in Table 1. We may notice 
that the estimators are biased in the cases where at least one histogram is the 
result of a simulation of 5 • 10 2 events. The bias is lower for 20-bin than for 
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5-bin histograms. The errors are asymmetric, and the asymmetry is reduced 
if the statistics of the generated events are reduced. The sizes of the tests a s 
are close to the nominal value of a = 5%; see jif for details of the method of 
comparison. 

In the right part of Table 1, we present results of calculations for the 
case where "er = 0", or R(x,x') = 5(x — x'), which helps us to understand 
the effect of the resolution function. The results of calculations for "infinite" 
statistics of the Monte Carlo simulation are also presented. In this case, 
the data histograms were fitted by probabilities Pi(a) that were calculated 
analytically: 

f bi+1/m , , , . f\ . l + abi + a/2m , . 

Pi\ a ) — i [\ + ax) dx I (1 + ax) dx = . , (19) 

J h J m + ma/2 

where hi is the lower bound of bin i. The statistic 

(n u - ni^(a)) 2 



i=i 



niPi(a) 



(20) 



was used to estimate the parameters and for goodness-of-fit tests. This case 
represents the best result that can be achieved for given statistics of the 
data, and is useful for comparison. The results presented in Table 1 show 
that a deterioration of the resolution leads to an increase in the statistical 
error of a and also a bias. We observe an asymmetry in the errors even in 
the case of an "infinite" Monte Carlo simulation. Note that the statistics 
(120]) for the estimated values of the parameters a have a Xm-2 distribution 
if the experimental histogram is the result of a random experiment with 
probabilities Pi(a), i — 1, . . . , m Q. 

For the purposes of illustration, we present the results of a parametric fit 
of the Monte Carlo results to the data for one of the cases described above. 
The numbers of generated events for the data and the Monte Carlo simulation 
were taken equal to 5 • 10 3 , and we used histograms with 20 bins. The result 
of fitting with MINUIT was a = l.ll±g;|§, with X 2 (d) = 11.72 and the p- 
value equal to 0.82. A comparison of the histograms of the true PDF with 
the weighted histogram of the Monte Carlo true PDF gave X 2 (d) = 18.03, 
and the p-value was equal to 0.45. Figure la shows the histograms of the 
reconstructed PDF and of the Monte Carlo reconstructed PDF, calculated 
with the weights of the events equal to 1 + 1.1 lx. Figure lb shows the 
histograms of the true PDF and of the Monte Carlo true PDF. 
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For convenience of visual comparison, the Monte Carlo histograms have been 
divided by a factor 1 — a/2 = 1 — 1.11/2 in both figures. 



3.2. Example 2 

We took the true PDF to be of the form 

/ x 2 1 
P{X) OC rr h — (21) 

Fy ! (x- 10) 2 + 1 (x - 14) 2 + 1 K J 

defined on the interval [4,16]. The reconstructed PDF P\{x') was defined 
according f fl~5|) with an acceptance function A{x) = 1 — (x — 10) 2 /36 and 
a resolution function of the form defined by formula ( fl~6l) with a = 1.5. 
The domain of the variable x was taken as Q = [4, 16] and that of x' as 
Q' = [4, 16]. A simulation of events with a PDF P±(x') was done according 
to the algorithm described in the previous section. 

For the Monte Carlo reconstructed PDF, we used the initial PDF 

3 6 4 

g(x) oc g(x) = - h r- (22) 

yK 1 yK ' (x- 9) 2 + 2.25 (x- 13) 2 + 4 v ; 

and chose the parametrization for the true PDF in the form 

p{x, a l7 a 2 , a 3 ) oc p{x, a u a 2 , a 3 ) = _ ^ - ^ + ^ _ fl ^ 2 + f " ( 23 ) 

The Monte Carlo reconstructed PDF P-z{x') was defined according formula 
(118j) . Monte Carlo events were simulated according to the algorithm de- 
scribed in the previous section with g(x) as the initial PDF. Weights of 
events were calculated according to the formula w(x) = g(x)/p(x,ai,a 2 ,a 3 ). 
Reconstructed and Monte Carlo reconstructed samples were simulated by 
generating 5 ■ 10 3 , and 2 • 10 4 events in the first step. We chose 40-bins 
histogram. 

The result of fitting with MINUET was a± = 1.87+°;^, a 2 = 9.86±g;gf and 
S3 = 14.05lg;2o , with X 2 (cii, a 2 , S3) = 32.00 and the p- value equal to 0.70. A 
comparison of the histograms of the true PDF with the weighted histogram 
of the Monte Carlo true PDF gave X 2 (d\, a 2 , S3) = 46.85, and the p-value 
was equal to 0.15. Figure 2a shows the histograms of the reconstructed PDF 
and of the Monte Carlo reconstructed PDF, calculated with the weights of 
the events equal to g(x)/p(x, Si, a 2 , S3). Figure 2b shows the histograms of 
the true PDF and of the Monte Carlo true PDF. For convenience of vi- 
sual comparison, the Monte Carlo histograms have been divided by a factor 
4 J 4 16 p(x, Si, S2, a,s)dx in both figures. 
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Figure 1: Example 1. Results of parametric fit of Monte Carlo results to data: (a) 
histograms of reconstructed PDF and Monte Carlo reconstructed PDF; (b) histograms 
of true PDF and Monte Carlo true PDF. 
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Figure 2: Example 2. Results of parametric fit of Monte Carlo results to data: (a) 
histograms of reconstructed PDF and Monte Carlo reconstructed PDF; (b) histograms 
of true PDF and Monte Carlo true PDF. 
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Table 1: Example 1. Mean values a of parameter estimates d, mean values A of errors of parameter estimates d, and sizes of 
test a s for a nominal size of test a = 5%. Calculations were done for histograms of the reconstructed PDF and Monte Carlo 
reconstructed PDF with various numbers of generated events nda and n mc for numbers of bins to = 5 and to = 20. The left part 
of the table presents calculations for a resolution function with a = 0.3, and the right part for a = 0. 



a = 0.3 cr = 









TO = 5 




mc 


to = 20 


n 


rnc 




TO = 5 








m = 20 










nda 




5-10 2 


5-10 3 


5-10 4 


5-10 2 


5-10 3 


5-10 4 




5-10 2 


5-10 3 


5-10 4 


oo 


5-10 2 


5-10 3 


5-10 4 


oo 






a 


1.29 


1.16 


1.13 


1.17 


1.07 


1.07 




1.08 


1.05 


1.04 


1.04 


1.04 


1.02 


1.01 


1.01 


5 


10 2 


A 


+3.13 


+1.16 


+0.81 


+ 1.84 


+0.85 


+0.79 




+0.68 


+0.48 


+0.44 


+0.43 


+0.62 


+0.44 


+0.42 


+0.40 








-0.66 


-0.52 


-0.54 


-0.61 


-0.46 


-0.45 




-0.11 


-0.34 


-0.33 


-0.29 


-0.39 


-0.30 


-0.28 


-0.29 






a s 


5.2% 


6.1% 


6.3% 


4.8% 


5.2% 


5.3% 




6.0% 


6.0% 


6.4% 


5.0% 


5.1% 


5.6% 


5.4% 


4.7% 






a 


1.12 


1.02 


1.01 


1.11 


1.02 


1.00 




1.03 


1.01 


1.00 


1.00 


1.02 


1.01 


1.00 


1.00 


5 


10 3 




+0.93 


+0.30 


+0.23 


+0.91 


+0.28 


+0.20 




+0.40 


+0.17 


+0.13 


+0.12 


+0.38 


+0.16 


+0.12 


+0.11 






A 


-0.52 


-0.22 


-0.17 


-0.47 


-0.22 


-0.16 




-0.34 


-0.15 


-0.11 


-0.10 


-0.31 


-0.14 


-0.10 


-0.10 






U s 


6.1% 


5.8% 


5.5% 


5.3% 


5.2% 


5.8% 




6.0% 


5.9% 


5.8% 


4.8% 


5.5% 


5.6% 


5.6% 


4.6% 






a 


1.10 


1.01 


1.00 


1.10 


1.01 


1.00 




1.04 


1.00 


1.00 


1.00 


1.02 


1.00 


1.00 


1.00 


■5 


10 4 




+0.87 


+0.22 


+0.09 


+0.83 


+0.20 


+0.08 




+0.40 


+0.12 


+0.05 


+0.03 


+0.36 


+0.11 


+0.05 


+0.03 






A 


-0.49 


-0.17 


-0.08 


-0.45 


-0.16 


-0.07 




-0.32 


-0.11 


-0.05 


-0.03 


-0.32 


-0.11 


-0.05 


-0.03 






a s 


5.4% 


6.0% 


5.7% 


5.4% 


5.7% 


5.6% 




5.7% 


5.5% 


5.8% 


4.9% 


5.5% 


6.0% 


5.6% 


5.3% 



4. Conclusions 



A method of fitting a parametric model to data measured with a de- 
tector with finite resolution and limited acceptance has been developed. It 
was developed as an application of a test for comparing histograms with 
unweighted entries and histograms with unnormalized weights proposed in 
previous work by the present author. The method demonstrates a new ap- 
proach to the direct parametric fitting of experimental data that permits one 
to decrease the systematic errors in the estimated parameters. It is a rather 
flexible tool for data analysis that can be used with multidimensional data, 
and does not have any restrictions on the configuration of the bins or the 
domain of the variables investigated. A goodness-of-fit test has been pro- 
posed that can be used for selection of the best parametric model from a set 
of alternative models for describing the data. An evaluation of the method 
has been done numerically for histograms with various numbers of bins and 
numbers of events. A numerical examples have been given to demonstrate 
the use of the method in practice. 
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